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We demonstrate that the recently proposed pruned-enriched Rosenbluth method PERM (P. Grass- 
berger, Phys. Rev. E, in press (1997) ) leads to extremely efficient algorithms for the folding of 
simple model proteins. We test it on several models for lattice heteropolymers, and compare to 
published Monte Carlo studies of the properties of particular sequences. In all cases our method is 
faster than the previous ones, and in several cases we find new minimal energy states. In addition to 
producing more reliable candidates for ground states, our method gives detailed information about 
the thermal spectrum and, thus, allows to analyze static aspects of the folding behavior of arbitrary 
sequences. 



INTRODUCTION 



Protein folding Jy-|4| is one of the most interesting and 
challenging problems in polymer physics and mathemat- 
ical biology. It is concerned with the problem of how a 
heteropolymer of a given sequence of amino acids folds 
into precisely that geometrical shape in which it performs 
its biological function as a molecular machine [^|^]. Cur- 
rently, it is much simpler to find coding DNA — and, 
thus, also amino acid — sequences than to elucidate the 
3-d structures of given proteins. Therefore, solving the 
protein folding problem would be a major break-through 
in understanding the biochemistry of the cell, and, fur- 
thermore, in designing artificial proteins. 

In this contribution we are concerned with the direct 
approach: given a sequence of amino acids, a molecu- 
lar potential, and no other information, find the ground 
state and the equilibrium state at physiological temper- 
atures. Note that we are not concerned with the kinetics 
of folding, but only in the final outcome. Also, we will 
not address the problems of how to find good molecular 
potentials and what is the proper level of detail 

in describing proteins J8|. Instead, we will use simple 
coarse-grained models which have been proposed in the 
literature and have become standards in testing the effi- 
ciency of folding algorithms. 

A plethora of methods have been proposed to solve this 
problem, ranging from simple Metropolis Monte Carlo 
simulations at some nonzero temperature Jl0| over multi- 
canonical simulation approaches [ pT| to stochastic op- 
timization schemes based, e.g., on simulated annealing 
p2| , and genetic algorithms . Alternative methods 

use heuristic principles jig ], information from databases 
of known protein structures, p6| , sometimes in combi- 
nation with known physico-chemical properties of small 
peptides. 

The algorithms we apply here are variants of the 



pruned-enriched Rosenbluth method (PERM) 0. This 
is a chain growth approach based on the Rosenbluth- 
Rosenbluth (RR) |L8| method. Preliminary results have 
been published before ]l9| ]. Here we will provide more 
details on the algorithm and on the analyses that can be 
performed, and we will present more detailed results on 
ground state and spectral properties, and on the folding 
behavior of the sequences analyzed. 



THE MODELS 

The models we study in this contribution are het- 
eropolymers that live on 2- and 3-dimensional regular 
lattices. They are self-avoiding chains with attractive or 
repulsive interactions between neighboring non-bonded 
monomers. 

The majority of authors considered only two kinds of 
monomers. Although also different interpretations are 
possible for such a binary choice, e.g. in terms of positive 
and negative electric charges (20), the most important 
model of this class is the HP model pl| , p2| . There, the 
two monomer types are assumed to be hydrophobic (H) 
and polar (P), with energies €hh = —1, (hp = (pp — 
for interaction between not covalently bound neigh- 
bors. Since this parameter set leads to highly degener- 
ate ground states, alternative parameters were proposed, 
e.g. e= (-3,-1,-3) || and e = (-1,0,-1) @. Note, 
however, that in these latter parameter sets, since they 
are symmetric upon exchange of H and P, the intuitive 
distinction between hydrophilic and polar monomers gets 
lost. 

An interesting extension to the HP model can be ob- 
tained by allowing the interactions to be anisotropic. 
This is done by introducing amphipatic (A) monomers 
that have hydrophobic as well as polar sides pq |. Such 
a generalization is possible for all lattice types, but we 
confine ourselves to two dimensions (2d) here. It can be 



1 



shown that in this HAP model for a wide range of inter- 
action parameters the inverse folding problem — i.e. the 
determination of a sequence that has a particular confor- 
mation as ground state — can be solved by construction 
p5j. While 3-dimensional chiral amphipatic monomers 
can be considered as well as non-chiral (the sides are 
allowed to rotate in the latter), in 2d only non-chiral 
monomers are possible. 

In the other extremal case of models, all monomers of 
a sequence arc considered to be different, and interaction 
energies are drawn randomly from a continuous distri- 



bution [£6|,g7|]. These models correspond, effectively, to 
assuming an infinite number of monomer types. 

THE SEQUENCES 

For the above models various sequences were analyzed 



in the literature, and in |19| we took these analyses as a 
test for our algorithm. Here we will take a closer look at 
the properties of some of the sequences that were consid- 
ered there. 



TABLE I. Newly found lowest energy states for binary sequences with interactions e = (ehh, cup, epp). Configurations are 



encoded as sequences of r(ight),/(eft), tt(p), rf(own), /(orward), and b(ackward). 

N d e sequence old -E m i n Ref. 

configuration E 

100 2 (-1,0,0) PeHPHzPsHsPHsPHzPziPzHzhPHsPHiaPHiPHrPuHTPiHPHsPeHPHi =44 @~ 

r%ur2U3rdzluldl2drd2ru2rs{rulu)2urdrd2ruilurzdld2rur^dslzuldl2ds,ru2rs,dzl2urul —47 
rdldldrd2r2d-il2drdldr2dl2dl2(urul)2urur2ul2U2l2drd2lul2uru2r2Uirul3drds,l2d2 — 

Idlu2ru2lu$rd2rdr —47 

Uzrzur^dl^drd2r2ulur^dr2dld2lu2luld2rdl2ulzdr2dr2drurdrzd2luld2lu2l2drd2luluzl2ul2uzru —47 
rdrzdldlzdrdrur2ur2ulur2urd2(ldrd)2l2U2ldl2dr2drzdl2dlulul2ul2dzldrzU2rdrdldl — 

dr2ururiUzrul —47 

rdzrdldl2uldzld2rurzdl2dr6ulzU2lzurzur2dld2rur2ulu2l2Uzluzr2d$r2d2rdru2lu2r2U2ldl2dl —47 
rzuzru2ruzl2ur2ul2urul4dr2dldrdld2rur-2dld2luld2rdlzuru2ldldzlzururu2rur2ulu— 

ldldl2uzldir —47 

100 2 (-1,0,0) ^IF^I^IF^PlF^T?^ =46 @~ 

ru2ldlu2ld2lu2luruluT2d2ru4X2dld4TUzrdrdld( > rdruzrul2U2Tdr2ululu2 {rd s )zTuridldld/J,ului.ru —49 

uzrdru2rd2ru2r2U2ldluldlu 5 ldf,l2d2luzr2UQhdzldr2dl2dldlu2ru2ldl2drdl2d2rurdrdzruzru —49 

ul2drdl2Uzld4ldrdl2U2hdzhuruzr2Uzrdzru4rul5dldr2d2luldldrdldluzlul2ulur2dr2Uzrd4,l —49 

60 2 (-1,0,0) P2H3PH8P3H10PHP3H12P4H6PH2PHP =34 

T^d2lulzdld2(ru)2Td2ldldrdr2uluru2rd2rdldr2Uzluzrd2rur —36 

80 3 (-1,0,-1) PH2Pz(HzP2HzPzH2PzhH i P i (HzP2HzPzH 2 Pz)H2 =94 ||f§ 

Ibruflbl2br2drur2dldlzulfrdrzurfldlzulururzdrblulzbrzblzdldrdrzurul2dlu —98 



2d HP model 

Two-dimensional HP chains were used in several pa- 
pers as test cases for folding algorithms. We shall discuss 
the following ones: 

(a) Several chains of length 20 to 64 were studied in |b| 
by means of a genetic algorithm. These authors quote 
supposedly exact ground state energies, and lowest en- 
ergies obtained by simulations. While these coincide for 
the shorter chains (N < 50), the authors were unable to 
fold the longer chains with N = 60 and 64. 

(b) Two chains with N = 100 were studied in p8[ . 
The authors claimed that their native configurations were 
compact, fitting exactly into a 10 x 10 square, and had 
energies —44 and —46, see Table I for the sequences and 
Fig. and || for the respective proposed ground state 
structures. These conformations were found by a spe- 
cially designed MC algorithm which should be particu- 
larly efficient for compact configurations. 




FIG. 1. Putative compact native structure of sequence 1 
from Table I (E = -45) according to [|§|; (filled circle) H 
monomers, (open circle) P monomers. 
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The authors tried to find the lowest energy states with 
two different methods, one being an heuristic stochastic 
approach |lq], the other based on exact enumeration of 



FIG. 2. Putative compact native structure of sequence 2 
from Table I (E = -46) according to pi. 



2d HAP model 

To analyze the performance of PERM on the HAP 
model we used e = (—4,-1,2) as energy parame- 
ters, for which set the inverse folding problem is solv- 
able ] |25fl . We choose a 3- helix structure with N = 42, 
see Fig. ||, which is the ground state of the se- 
quence pA(HAAH) 3 PA 3 H w A 3 P(HAAH) 3 Ap, where A 
denotes an amphipatic intra-chain group with one hy- 
drophobic and one polar side, while p denotes an am- 
phipatic end group with one hydrophobic and two polar 
sides. 




FIG. 3. Ground state structure of the HAP sequence; 
(filled circle) H monomers, (structured circle) A monomers, 
(open circle) P monomers. 



3d HP model 

Ten sequences of length N — 48 were given in pj[ . 
Each of these sequences was designed by minimizing the 
energy of a particular target conformation in sequence 
space under the constraint of constant composition [|3ll . 



low energy states 1 32 . With the first method they failed 
in all but one case to reach the lowest energy. With the 
second method in all but one cases they obtained con- 
formations with energies that were even lower than the 
putative ground states the sequences were designed for, 
while for one case the ground state energy was confirmed. 
Precise CPU times were not quoted. 



3d modified HP model 



A most interesting case is a 2-species 80-mer with in- 
teractions (—1,0, —1) studied first in |E4). These partic- 
ular interactions were chosen instead of the HP choice 
(—1,0,0) because it was hoped that this would lead to 
compact configurations. Indeed, the sequence was spe- 
cially designed to form a "four helix bundle" which fits 
perfectly into a 4 x 4 x 5 box, see Fig. [|. Its energy in 
this putative native state is —94. Although the authors 
of [^4j used highly optimized codes, they were not able 
to recover this state by MC. Instead, they reached only 
E = —91. Supposedly, a different state with E = — 94 
was found in p8j, but figure 10 of this paper, which is 
claimed to show this configuration, has a much higher 
value of E. Configurations with E = — 94 but slightly 
different from that in |24| and with E = —95 were found 
in [^9| by means of an algorithm similar to that in |p8| . 
For each of these low energy states the author needed 
about one week of CPU time on a Pentium. 




FIG. 4. Putative native state of the "four helix bundle" 
sequence, see Table I, as proposed by O 'Toole et al.. It has 
E — —94, fits into a rectangular box, and consists of three ho- 
mogeneous layers. Structurally, it can be interpreted as four 
hefix bundies. 
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3d oo monomer types 

Sequences with N — 27 and with continuous inter- 
actions were studied in p^| . Interaction strengths were 
sampled from Gaussians with fixed non-zero mean and 
fixed variance. These N(N — l)/2 numbers were first at- 
tributed randomly to the monomer pairs, then they were 
randomly permuted, using a Metropolis accept/reject 
strategy with a suitable cost function, to obtain good 
folders. Such "breeding" strategies to obtain good fold- 
ers were also developed and employed by other authors 
for various models @||,||, and seem necessary to elim- 
inate sequences which fold too slowly and/or unreliably. 
It is believed that also during biological evolution opti- 
mization processes took place with similar effects, so that 
actual proteins are better folders than random amino se- 
quences. 



THE ALGORITHM 

The algorithms we apply here are variants of the 
pruned-enriched Rosenbluth method (PERM) [jl7|], a 
chain growth algorithm based on the Rosenbluth- 
Rosenbluth (RR) [Q method. There, monomers are 
placed sequentially at vacant sites either with uniform 
probability, or with some non-uniform probability distri- 
bution. In cither case it leads to weighted samples where 
each configuration carries a weight W . For long chains 
or low temperatures, the spread in weights can become 
very wide which then leads to serious problems |35j . But 
since the weights accumulate as the chains grow, one can 
interfere during the growth process by 'pruning' config- 
urations with low weights and replacing them by copies 
of high- weight configurations. This is in principle simi- 
lar to population based methods in polymer simulations 
US and in quantum Monte Carlo (MC) |§, but the 
implementation is different. Pruning is done stochasti- 
cally: if the weight of a configuration has decreased be- 
low a threshold W < , it is eliminated with probability 1 /2, 
while it is kept and its weight is doubled in the other half 
of cases. Copying ('enrichment' (39]) is done indepen- 
dently of this. If W increases above another threshold 
W > , the configuration is replaced by n copies, each with 
weight W/n. Technically, this is done by putting onto a 
stack all information needed about configurations which 
still have to be copied. This is most easily implemented 
by recursive function calls. Thereby one avoids the need 
for keeping large populations of configurations |36|-[38||. 
PERM has proven extremely efficient for studies of lat- 
tice homopolymers near the 9 point G^] . It has also been 
successfully applied to phase equilibria E^] , to the order- 
ing transition in semi-stiff polymers E| , and to spiraling 
transitions of polymers with interactions depending on 
relative orientation of monomers Eal. We refer to these 



papers for more detailed descriptions of the basic algo- 
rithm. 

The main freedom when applying PERM consists in 
the a priori choice of the sites where to place the next 
monomer, in the thresholds W K and W > for pruning and 
copying, and in the number of copies made each time. All 
these features do not affect the formal correctness of the 
algorithm, but they can greatly influence its efficiency. 
They may depend arbitrarily on chain lengths and on 
local configurations, and they can be changed freely at 
any time during the simulation. Thus the algorithm can 
'learn' during the simulation. 

In order to apply PERM to heteropolymers at very low 
temperatures, the strategies proposed in fl7j] are modified 
as follows. 

(1) For homopolymers near the theta-point it had been 
found that the best choice for the placement of monomers 
was not according to their Boltzmann weights, but uni- 
formly on all allowed sites Jl7],[40| . This might be surpris- 
ing since the Boltzmann factor has then to be included 
into the weight of the configuration, which might lead to 
large fluctuations. Obviously, this effect is counterbal- 
anced by the fact that larger Boltzmann factors corre- 
spond to higher densities and thus to smaller Rosenbluth 
factors 

For a heteropolymer this has to be modified, as there 
is no longer a unique relationship between density and 
Boltzmann factor. In a strategy of 'anticipated impor- 
tance sampling' we should preferentially place monomers 
on sites with mostly attractive neighbors. Assume that 
we have two kinds of monomers, and we want to place a 
type-^4 monomer. If an allowed site has ms neighbors of 
type B (B — H,P), we select this site with a probability 
oc 1 + aAHT^H + aAprrip. Here, olab are constants with 
ciab > for €ab < and vice versa. 

(2) Most naturally, W > and W K are chosen propor- 
tional to the estimated partition sum Z n 0. This be- 
comes inefficient at very low T since Z n will be underes- 
timated as long as no low-energy state is found. When 
this finally happens, W > is too small. Thus too many 
copies are made which are all correlated but cost much 
CPU time. 

This problem can be avoided by increasing W > and 
W < during particularly successful 'tours' (a tour is the 
set of configurations derived by copying from a single 
start p7| ). But then also the average number of long 
chains is decreased in comparison with short ones. To 
reduce this effect and to create a bias towards a sample 
which is flat in chain length, we multiply by some power 
of M n /Mi, where M n is the number of generated chains 
of length n. With N(n) denoting the number of chains 
generated during the current tour we used therefore 

W< = C Z n [(1 + M{n)/M){M n + M)/(Mi + M)] 2 , 

and W > — rW K . Here, C is a constant of order unity, 
r « 10, and M is a constant of order 10 4 — 10 5 . 
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(3) Creating only one new copy at each enrichment 
event (as done in p7[ ), cannot prevent the weights from 
exploding at very low T . Thus we have to make several 
copies if the weight is large and surpasses W > substan- 
tially. A good choice for the number of new copies created 
when W > W> is mt[^/W/W>}. 

(4) Two special tricks were employed for 'compact' 
configurations of the 2-d HP model filling a square. First 
of all, since we know in this case where the boundary 
should be, we added a bias for polar monomers to actu- 
ally be on that boundary, by adding an additional energy 
of -1 per boundary site. Note that this bias has to be cor- 
rected in the weights, thus the final distributions are un- 
affected by it and unbiased. Secondly, in two dimensions 
we can immediately delete chains which cut the free do- 
main into two disjoint parts, since they never can grow 
to full length. In the present simulations, we checked 
for this by looking ahead one time step. In spite of the 
additional work this was very efficient, since it reduced 
considerably the time spent on dead-end configurations. 

(5) In some cases we did not start to grow the chain 
from one end but from a point in the middle. We grew 
first one half, and then the other. Results were aver- 
aged over all possible starting points. The idea behind 
this is that real proteins have folding nuclei 0], and it 
should be most efficient to start from such a nucleus. 
In some cases this trick was very successful and speeded 
up the ground state search substantially, in others not. 
We take this observation as an indication that in vari- 
ous sequences the end groups already provide effective 
nucleation sites. This is e.g. the case for the 80-mcr 
with modified HP interactions of fl24| . We also tried to 
grow the chain on both sides simultaneously. However it 
turned out that this is not effective computationally [Q . 

(6) In the case of the HAP model it turned out that, 
while the ground state configuration of the chain geom- 
etry could be reached easily, the ground state configura- 
tions of the side groups (i.e., of the H/P bond attributions 
of amphiphilic monomers [^|) could note be reached ef- 
fectively using PERM alone. Therefore, after building 
the chain conformation to its full length we let the side 
groups of the amphipatic monomers rotate thermally us- 
ing a Metropolis algorithm. This approach utilizes the 
short relaxation time of side group fluctuations within 
the subphase of a fixed chain conformation and leads to 
the desired ground states 

(7) For an effective sampling of low-lying states the 
choice of simulation temperature T appears to be of im- 
portance. If it is too large, low-lying states will have a low 
statistical weight and will not be sampled reliably. On 
the other hand, if T is too low, the algorithm becomes 
too greedy: configurations which look good at first sight 
but lead to dead ends are sampled too often, while low 
energy configurations, whose qualities become apparent 
only at late stages of the chain assembly, are sampled 
rarely. Of course they then get huge weights (since the 



algorithm is correct after all), but statistical fluctuations 
become huge as well. This is in complete analogy to the 
slow relaxation hampering more traditional (Metropolis 
type) simulations at low T - note, however, that "relax- 
ation" in the proper sense does not exist in the present 
algorithm. 

In the cases we considered it turned out to be most ef- 
fective to choose a temperature that is below the collapse 
transition temperature (note, however, that this transi- 
tion is smeared out, see the results below) but somewhat 
above the temperature corresponding to the structural 
transition which leads to the native state. This obser- 
vation corresponds qualitatively to the considerations of 
p6j, although a quantitative comparison appears not to 
be possible. 

(8) For 2-dimensional HP chains, we performed also 
some runs where we restricted the search for native con- 
figurations further, by disallowing non-bonded HP neigh- 
bor pairs. The idea behind this was that such a pair costs 
energy, and is thus less likely to appear in a native state. 
But this is only a weak heuristic argument. Forbidding 
such pairs certainly gives wrong thermal averages, and it 
might prevent the native state to be found, if it happens 
to contain such a pair. But in two cases this restriction 
did work, and gave states with lower energies than those 
we could reach without this trick. 



RESULTS 

Let us now discuss our results. All CPU times quote 
below refer to SPARC Ultra machines with 167 MHz. 



2d HP model 

(a) For all chains of [|l3) we easily reached the ground 
state, except for the longest chain (N = 64). For this 
chain the ground state is very regular, with all polar 
monomers on the outside, and with no non-bonded HP 
neighbors. Its energy is -42 (see fig.||). The authors of 
|l3f were unable to recover by simulations this state and 
any other state with E < —37. Although we could not 
reach E = —42 either without special tricks, we obtained 
at least E — —40 after ca. 4h CPU time. Forbidding 
non-bonded HP neighbor pairs, we found the native state 
easily, but this cannot be really counted as a success since 
we knew in advance that such pairs do not appear in the 
native state. The difficulty posed by this sequence for 
PERM is obvious from fig.^[ there is no folding center 
in this chain. No matter where one starts the assembly, 
one first has to construct a large part of the boundary 
before the interior behind this boundary is filled up. This 
intermediate state has a very low Boltzmann weight and 
acts thus as a bottleneck for PERM. 
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FIG. 5. One of the ground states of the N = 64 sequence 
of jl3|. The other ground states differ in the configurations 
of the chain ends filling the interior of the structure, but have 
the same boundary and overall shape. 

Note that the configuration of Fig.|5| cannot be obtained 
by popular local moves including, e.g., bond exchange or 
crankshaft, or by reptation. But it can be reached if 
pivot moves are included. This illustrates that folding 
properties can depend strongly on the chosen kinetics. 

In contrast to the sequence with N = 64, we had no 
problem with the shorter sequences of ]13| |. In particular, 
for the N = 60 sequence we found a configuration with 
E = —36 (see Table I), although the authors had quoted 
E = —34 as supposedly exact ground state energy. 




FIG. 6. One of the compact structures of sequence 1 with 
energy (E = —46) lower than the "native" state proposed by 
Ramakrishnan et al. 

(b) For the two HP chains of || with N = 100, see 
Table I, we found several compact states (within ca. 40 
hours of CPU time) that had energies lower than those 
of the compact putative ground states proposed in p8[ . 
Figures ^| and ^ show representative compact structures 
with E = —46 for sequence 1 and E — —47 for sequence 
2. Moreover, we found (again within 1-2 days of CPU 
time) several non-compact configurations with energies 



even lower: E = — 47 and E = — 48 for sequence 1 and 
2, respectively. Forbidding non-bonded HP pairs, we ob- 
tained even E = —49 for sequence 2. Figures || and ^ 
show representative non-compact structures with these 
energies; a non-exhaustive collection of these is listed in 




FIG. 7. One of the compact structures of sequence 2 with 
lower energy (E = —47). 




FIG. 8. One of the (non-compact) lowest energy sequences 
for sequence 1 (E — —47). 




FIG. 9. One of the (non-compact) lowest energy sequences 
for sequence 2 (E = —49). 
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Table I. These results reflect the well-known property 
that HP sequences (and those of other models) usually 
have ground states that are not maximally compact, see, 
e.g. although there is a persistent prejudice to the 
contrary MMM. 



2d HAP model 

The ground state of the HAP triple helix was found 
within several minutes of CPU time using the PERM- 
Metropolis hybrid algorithm. It was not found to be 
degenerate. 

In order to obtain information about the folding tran- 
sition, energy and contact matrix (see below) histograms 
were determined at T = 1.25, 2, and 4. Free energy differ- 
ences, necessary for combining the histograms Jt8| , were 
determined using Bennet's acceptance ratio method Jj9[. 



the specific heat near T = 1.0. Note that specific heat 
and energy fluctuations emphasize the two transitions 
differently due to the factor T 2 by which they differ. 
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FIG. 10. Number of native contacts, total number of con- 
tacts, and energy E vs temperature T for the HAP sequence 
of Fig. |. 

In Fig. ^ the thermal behavior of the mean number 
of contacts, number of native contacts, and of the mean 
energy is shown. While the structural transition to the 
ground state phase is best monitored by the number of 
native contacts and takes place around T = 1, the com- 
pactification of the polymer chain is most clearly seen in 
the mean number of all contacts and in the radius of gy- 
ration (not shown here). It takes place already at much 
higher temperatures and is smeared out over a wide tem- 
perature range. Note that the number of all contacts 
follows closely the behavior of the energy. 

These two transitions are seen more clearly when the 
fluctuations of the above order parameters are consid- 
ered, see Fig. Energy and contact number fluctua- 
tions exhibit a broad maximum around T = 3.5, while 
the structural transition is indicated by a narrow peak of 
the fluctuations of the number of native contacts and of 



FIG. 11. Fluctuations of the number of native contacts, 
total number of contacts, and energy E vs temperature T for 
the HAP sequence of Fig. ^; Since energy fluctuations and 
specific heat emphasize the polymer collapse and structural 
transition differently, we have included the specific heat, too. 



3d HP model 

With PERM we succeeded to reach ground states of 
the ten sequences of length N — 48 given in |3(J in all 
cases, in CPU times between a few seconds and 5 hours, 
see Table II. In these simulations we used a rather sim- 
ple version of PERM, where we started assembly always 
from the same end of the chain. We found that the se- 
quences most difficult to fold were also those which had 
resisted previous Monte Carlo attempts |3(J. In those 
cases where a ground state was hit more than once, we 
verified also that the ground states were highly degen- 
erate. In no case there were gaps between ground and 



first excited states, see Fig. 12. Therefore, none of these 



sequences is a good folder, though they were designed 
specifically for this purpose. 



3d modified HP model 

For the two-species 80-mer with interactions 
(—1,0,-1), even without much tuning our algorithm 
gave E — — 94 after a few hours, but it did not stop 
there. After a number of rather disordered configura- 
tions with successively lower energies, the final candidate 
for the native state has E = — 98. It again has a highly 
symmetric shape, although it does not fit into a 4 x 4 x 5 



box, see Fig. 13, It has twofold degeneracy (the central 
2x2x2 box in the front of Fig. O can be flipped) , and 
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both configurations were actually found in the simula- 
tions. Optimal parameters for the ground state search 
in this model are (3 = 1/kT w 2.0, app = clhh ~ 2, 

TABLE II. PERM performance for the binary se- 
quences from |30|. 



sequence 


-S'min 


Ei b 
-CMC 


^success 


CPU time 


nr. 








(min) 


1 


32 


31 


101 


6.9 


2 


34 


32 


16 


40.5 


3 


34 


31 


5 


100.2 


4 


33 


30 


5 


284.0 


5 


32 


30 


19 


74.7 


6 


32 


30 


24 


59.2 


7 


32 


31 


16 


144.7 


8 


31 


31 


11 


26.6 


9 


34 


31 


1 


1420.0 


10 


33 


33 


16 


18.3 



a Ground state energies JifJ. 

b Previously reached energies with Monte Carlo methods 
@- 

c Number of independent tours in which a ground state 
was hit. 
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FIG. 12. Energy spectrum of the ten sequences given in 
|jo) . More precisely, to emphasize the low-energy part of the 
spectrum, we show the histograms obtained from the spectra 
by multiplying with e B//T , T = 0.334. Note that there are no 
energy gaps in any of these spectra. 

and <ihp ~ —0.13. With these, average times for finding 
E = —94 and E = —98 in new tours are ca. 20 min and 
80 hours, respectively. 

A surprising result is that the monomers are arranged 
in four homogeneous layers in Fig. while they had 
formed only three layers in the putative ground state of 
Fig. ^. Since the interaction should favor the segregation 
of different type monomers, one might have guessed that 
a configuration with a smaller number of layers should 
be favored. We see that this is outweighed by the fact 
that both monomer types can form large double layers 



in the new configuration. Again, our new ground state is 
not 'compact' in the sense of minimizing the surface, and 
hence it also disagrees with the wide spread prejudice 
that native states are compact. 

In terms of secondary structure, the new ground state 
is fundamentally different from the putative ground state 




FIG. 13. Conformation of the "four helix bundle" se- 
quence with E = —98. We propose that this is the actual 
ground state. Its shape is highly symmetric although it does 
not fit into a rectangular box. It is not degenerate except for 
a flipping of the central front 2x2x2 box. 
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FIG. 14. Contact matrix of the structure in Fig. |l^; a 
black point at (i, j) indicates that there is a contact between 
monomer i and monomer j; grey points indicate contacts in 
only one of the two native states, corresponding to the twofold 
degeneracy of the central 2x2x2 box. Note that the lines 
orthogonal to the main diagonal correspond to anti-parallel /3 
sheet secondary structure elements, see e.g. [pO|. 
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FIG. 15. For comparison, the contact matrix of the puta- 
tive ground state of Ref. [^ll in Fig. ^; note that point triples 
close to the diagonal parallel as well as orthogonal to it are 
signatures of 3d helical secondary structure elements, see e.g. 
J50| ; the other points denote tertiary contacts between helices. 
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FIG. 16. Histograms of (top) thermal weight, (middle) ra- 
dius of gyration, Rq, and (bottom) end-to-end distance, R^ e , 
vs energy E for the 80-mer "four helix bundle" at T — 0.75. 



FIG. 17. (top) Average end-to-end distance, R^ e , and ra- 
dius of gyration, Rq, (middle) specific heat per monomer, C v , 
and average energy per monomer, < E >, vs temperature T 
for the 80-mer "four helix bundle". 



T=2.0 
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FIG. 18. Thermally averaged contact matrix for the 
80-mer "four helix bundle" in the random coil phase 
(T — 2.0). Different shades of grey denote different proba- 
bilities for the contact to exist. 
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of Ref. |p8| . While the new structure (Fig. |T^) is domi- 
nated by (3 sheets, which can most clearly be seen in the 
contact matrix (see Fig. |l4|), the structure in Fig. || is 
dominated by helices, see also the corresponding contact 
matrix in Fig. |l5| . 

T=0.85 




10 20 30 40 50 60 70 80 



FIG. 19. Thermally averaged contact matrix for the 
80-mer "four helix bundle" in the collapsed but unstructured 
phase (T = 0.85). 

In order to analyze the folding transition of this se- 
quence we again constructed histograms of the distribu- 
tion of energy, end to end distance, and radius of gyra- 
tion, by combining the results obtained at various tem- 
peratures between T = 0.45 and 3. Figure [l6| shows 
these distributions, reweighted so that it corresponds to 
T = 0.75. The thermal behavior of these order parame- 
ters as functions of T is obtained by Laplace transform, 
and is shown in Fig. The behavior of energy, end-to- 
end distance and radius of gyration follow closely each 
other and exhibit clearly only the smeared out collapse 
of the chain from a random coil to some unstructured 
compact phase. In contrast, the specific heat exhibits 
more structure: the shoulder around T = 1 corresponds 
again to the coil-globule collapse, but there are additional 
transitions seen around T = 0.62 and T — 0.45. The 
last one is the transition to the /3-sheet dominated na- 
tive phase. However, the transition at T — 0.62 is from 
a unstructured globule to an intermediate phase that is 
helix-dominated but exhibits strong tertiary fluctuations. 
These structural transitions are illustrated in Figs |l8| to 
where the thermally averaged contact matrices are 
shown for the respective phases. 

The intermediate, helix-dominated phase is particu- 
larly interesting. To it apply some of the usual charac- 
teristics of a molten globule state [ pi] : i) compactness, 
ii) large secondary structure content (although not neces- 
sarily native), and iii) strong fluctuations. This qualifies 



it as a candidate for a molten globule state, a phase that 
is absent in the folding transition of the HAP sequence 
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FIG. 20. Thermally averaged contact matrix for 

the 80-mer "four helix bundle" in the intermediate he- 
lix-dominated phase (T = 0.5). 



T=0.4 
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FIG. 21. Thermally averaged contact matrix for the 
80-mer "four helix bundle" at the transition from the interme- 
diate helix-dominated phase to the /3-sheet dominated phase 
(T=0.4). 



3d oo monomer types 

For all sequences with N — 27 from p7| we could reach 
the supposed ground state energies within < 1 hour. In 



10 



no case we found energies lower than those quoted in 
p7j] , and we verified also the energies of low-lying ex- 
cited states given in (^tJ. Notice that these sequences 
were designed to be good folders by the authors of [E7| . 



o 

FIG. 
80-mer 
(T = 0. 
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22. Thermally averaged contact matrix for the 
"four helix bundle" in the /3-sheet dominated phase 
3). 



This time the design had obviously been successful, which 
is mainly due to the fact that the number of different 
monomer types is large. All sequences showed some gaps 
between the ground state and the bulk of low- lying states, 
although these gaps are not very pronounced in some 
cases. 

More conspicuous than these gaps was another fea- 
ture: all low lying excited states were very similar to 
the ground state, as measured by the fraction of contacts 
which existed also in the native configuration. Stated dif- 
ferently, if the gaps were not immediately obvious, this 
was because they were filled by configurations which were 
very similar to the ground state and can therefore easily 
transform into the native state and back. Such states 
therefore cannot prevent a sequence from being a good 
folder. For none of the sequences of |p7| we found truely 
misfolded low-lying states with small overlap with the 
ground state. 

Figure illustrates this feature for one particular se- 
quence. There we show the overlap Q, defined as the frac- 
tion by non-bonded nearest-neighbor ground state con- 
tacts which exist also in the excited state, against the 
excitation energy. For this and for each of the follow- 
ing figures, the 500 lowest-lying states were determined. 
We see no low energy state with a small value of Q. To 
demonstrate that this is due to design, and is not a prop- 
erty of random sequences with the same potential distri- 
bution, we show in Fig. ^ the analogous distribution for 
a random sequence. 
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FIG. 23. Overlap with ground state, Q, vs energy E of 
the lowest energy conformations for sequence no. 70 of Ref. 
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FIG. 24. Overlap with ground state, Q, vs energy E of the 
lowest energy conformations for a single random sequence. 



To demonstrate that this difference is not merely due 
to a statistical fluctuation, we show in Fig. |25| the distri- 
butions for ten sequences from collected in a single 
plot. Since the ground state energies differ considerably 
for different sequences, we used normalized excitation en- 
ergies (E — Eq)/Eq on the x-axis. Analogous results for 
ten random sequences are shown in Fig. ^6|. While there 
is no obvious correlation between Q and excitation ener- 
gies for the random case, all low energy states with small 
Q have been eliminated in the designed sequences (note 
the different ranges of Q in Figs. ^ and p6| ). 

This elimination of truely misfolded low energy states 
without elimination of native-like low energy states might 
be an unphysical property of the design procedure used in 
p7f, but we do not believe that this is the case. Rather, 



11 



1.0 o- 



0.9 - 




0.00 0.05 0.10 

(E-E„)/E„ 

FIG. 25. Overlap with ground state, Q, vs energy E of 
the lowest energy conformations for sequences no. 61 to 70 of 
Ref. |^|; for better visibility, the same symbol is used for all 
sequences. 
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FIG. 26. Overlap with ground state, Q, vs energy of the 
lowest energy conformations for ten random sequences; for 
better visibility, the same symbol is used for all sequences. 

it should be a general feature of any design procedure, in- 
cluding the one due to biological evolution. It contradicts 
the claim of [^6| that it is only the gap between native 
and first excited state which determines foldicity. On the 
other hand, our results are consistent with the "funnel" 



scenario for the protein folding process |53|, where the 
folding pathway consists of states successively lower in 
energy and closer to the native state. 

We note that for random sequences there are also ex- 
cited states that have unit overlap with the native state, 
a feature not present in the folding sequences. These 
are cases where the native state has open loops and/or 
dangling ends, so that more compact conformations have 
all contacts of the native state, but have - in addition 
- energetically unfavorable contacts resulting in a higher 
total energy. 



SUMMARY AND OUTLOOK 



We showed that the pruned-enriched Rosenbluth 
method (PERM) can be very effectively applied to pro- 
tein structure prediction in simple lattice models. It is 
suited for calculating statistical properties and is very 
successful in finding native states. In all cases it did bet- 
ter than any previous MC method, and in several cases 
it found lower energy states than those which had previ- 
ously been conjectured to be native. 

We verified that ground states of the HP model are 
highly degenerate and have no gap, leading to bad fold- 
ers. For sequences that are good folders we have estab- 
lished a funnel structure in state space: low-lying excited 
states of well-folding sequences have strong similarities 
to the ground state, while this is not true for non-folders 
with otherwise similar properties. 

Especially, we have presented a new candidate for 
the native configuration of a "four helix bundle" se- 
quence which had been studied before by several authors. 
The ground state structure of the "four helix bundle" 
sequence, being actually foeta-sheet dominated, differs 
strongly from the helix-dominated intermediate phase. 
This sequence, therefore, should not be a good folder. 

Although we have considered only lattice models in 
this paper, we should stress that this is not an inher- 
ent limitation of PERM. Straightforward extensions to 
off-lattice systems are possible and are efficient for ho- 
mopolymers at relatively high temperatures Wm . Pre- 
liminary attempts to study off-lattice heteropolymers at 
low T have not yet been particularly successful, but the 
inherent flexibility of PERM suggests several modifica- 
tions which have not yet been investigated in detail. One 
of them are hybrid PERM-Metropolis approaches simi- 
lar to that used for the HAP model in the present pa- 
per. Its success also suggests that similar hybrid ap- 
proaches should be useful for models with more compli- 
cated monomers. Another improvement of PERM which 
could be particularly useful for off-lattice simulations 
might consist in more sophisticated algorithms for posi- 
tioning the monomers when assembling the chain. Work 
along these lines is in progress, and we hope to report on 
it soon. 
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